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Abstract 

We present a performance comparison of the Kramers equation and the boson 
algorithms for simulations of QCD with two flavors of dynamical Wilson fermions 
and gauge group SU{2). Results are obtained on 6"^12, 8"^12 and lattices. In 
both algorithms a number of optimizations are installed. 
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1 Introduction 



One of the most pressing problems in lattice QCD today concerns numerical simulations of 
dynamical fermions. The CPU-time spent on present computers to generate a statistically 
independent configuration on reasonably sized lattices (say 16'*) easily reaches hours. This 
makes it difficult to obtain sufficient statistics for an accurate determination of relevant physical 
quantities. It is therefore not surprising that the search for new algorithmic techniques or 
improvements on existing algorithms is an active research area. 

In this letter we compare the performance of the Kramers equation [1, 2] and the boson 
algorithms [3, 4, 5], both aimed at simulations of dynamical fermions. The performance of 
an algorithm is the product of the speed of the program, i.e. the CPU-time to generate a 
configuration (not necessarily an independent one) and the autocorrelation time of a given 
observable. Of course, the speed of the program depends on the chosen computer architecture. 
To be complete, one therefore also has to specify the machine on which numerical tests are 
performed. In our case we used the Alenia Quadrics (APE) massively parallel computers. 
Although our results will be presented for this particular machine as an example, we will also 
give a more machine independent measure for the performance in section 3. 

Before starting to compare the two algorithms under consideration, we spent some effort 
to optimize them. These attempts are described for the Kramers equation algorithm in ref. [2] 
and for the boson algorithm in ref. [5]. We direct the interested reader to these references for 
further details. Let us here only summarize these works by mentioning that with relatively 
simple modifications of the algorithms large improvement factors of 0(10) can be obtained. 

We understand this investigation as only one step of gaining experience with the behavior 
of the two algorithms, in particular with the relatively new boson algorithm. It is clear that 
a full QCD simulation is very costly and one should not hope for a similar precision for the 
autocorrelation time as in, say, spin models. There, algorithms can be tested thoroughly with 
several million of configurations which even allow for a determination of the critical dynamical 
exponent [6]. Our aim here is much more moderate. In order to get reliable numbers for 
the autocorrelation times and estimates of their errors, we stay in a situation where we are 
not deep in the physically most interesting however numerically very challenging region of 
chiral symmetry restoration. We hope that enough "statistics" will be gathered by including 
also results from other studies in the future. For a review of the present status of fermion 
algorithms see [7]. 

As in refs. [4, 2, 5], we will study the standard lattice Wilson QCD with gauge group 
SU{2). We will work on a 4-dimensional euclidean space-time lattice with volume = L^^Lf 
and periodic boundary conditions in all four directions. The gauge field U^{x) G SU{2) lives 
on the link pointing from x to x -\- fi^ where fi = 0, 1,2,3 designates the 4 forward directions 
in space-time. The quark fields are denoted by tl^Aaa{x) where A, a and a are fiavor, color and 
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Dirac indices respectively. The full partition function for the model is given by, 

Z = j VUVy^Vt^exp {-Sg - S^) , (1) 
where the gauge action Sg and the Wilson fermion action are given by: 

= J2H^)iD + ^o)i^i^) ■ (2) 

X 

The Wilson difference operator D which appears in the above expression is given by: 

D = ^E[7.(v, + v;)-v,v;] , 

V^V(^) = Uf,{x)tlj{x + fi) - tlj{x) , 

V^ipix) = ip{x) - Ul{x - fi)ip{x - fi) , (3) 

and Up is the usual plaquette term on the lattice. In the following, we will consider two flavors 
of Wilson fermions with degenerate masses. Let us write the fermion matrix M in the hopping 
parameter k, representation, M = 2K,[D-\-mo) with mo the bare quark mass and k, = (8+2mo)~^. 
In order to go over to the bosonic theory later, we introduce the hermitian matrix 

Q = 7sM/ [cm(1 + Sk)] , (4) 

where cm is a free parameter of 0(1) which will be tuned to optimize the bosonic simulation 
algorithm. The parameter cm has to be chosen such that the eigenvalues A of Q satisfy 

< |A| < I. (5) 

As usual, for the simulations using molecular dynamics algorithms, the fermion determinant is 
written in terms of Gaussian scalar fields (f> such that the path integral reads 

Z = j VUV(P^V(Pe-^-ff , 
Seff = Sg + ^HQ)-'^. (6) 

1.1 Transformation to the bosonic theory 

Following ref. [3], the path integral in eq.(6) can be written as a /oca/ bosonic theory. We start 
by approximating the function 1/s in terms of a series of polynomials P„(s) of even degree n 
that satisfy 

lim PJs) = 1/s for all < s < 1. (7) 

n^oo 

In order to apply eq.(7) to detQ^ as needed for the simulations, we consider the matrix-valued 
polynomials Pn{Q^). Because of relation (5) the determinant may be written as 

detg2 = lim \detPjQ^)]~\ (8) 
3 



The polynomial may be factorized by determining its roots Zk^ A: = 1, n 

n 

Pn{Q')^l[ [{Q - IJ^kY + '^l] , (9) 

k=l 

where 

l^k + ii^k = \fzk, Vk > 0. (10) 

A particular example for such roots will be given below. Using eq.(9), the determinant factorizes 
accordingly. Integrating the individual determinants in again by means of scalar fields, the 
partition function becomes 



/n 
VU n V<5kV<5le-^^^^''^^ , (11) 



k=i 
with 

s, [u, $] = s, [u] + EE - ^k)M^)\' + ^imx)\'] . (12) 

X k 

Note that this form of the bosonic expression for the QCD path integral leads to a completely 
local theory in contrast to eq.(6) which is highly non-local due to the appearance of the inverse 
fermion matrix. The price to pay is obviously the introduction of a number n of scalar field 
copies. The choice of the polynomial we will be using in this work is as in [4]. It is a Chebyshev 
polynomial, the roots of which are given by 

1 1 2Trk /- 2Trk 

Zk = - 1 + e - - 1 + e cos^- -zVesm^- . 13 
z z n+l n+l 

It approximates the function 1/s uniformly in the interval e < s < 1. The relative fit error 

Rn{s) = [Pn{s)-l/s]s (14) 

in this interval is exponentially small: 

\Rn{s)\ < 2 i-^^] . (15) 



We define the accuracy parameter 
which gives a measure of how well the chosen polynomial approximates 1 /s in the given interval. 



6 = max |i?„(s)| (16) 

e<s<l 



2 The algorithms 

As mentioned in the introduction, we studied two kinds of algorithms, the Kramers equation 
and the boson algorithm. In the following we discuss the basic ideas of these algorithms and 
list the improvements implemented. 
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2.1 Kramers equation algorithm 

This algorithm has its origin from techniques based on the Langevin equation. It falls into the 
class of molecular dynamics algorithms and is very closely related to the Hybrid Monte Carlo 
algorithm [8, 2]. When one considers Brownian motion in an external field, one may describe 
the evolution of the particle's momentum p and coordinate q separately. The master equation 
(generalized Fokker-Planck equation) in this situation is called the Kramers equation [9]. The 
corresponding generalized Langevin form of the Kramers equation may be written as a formal 
continuum stochastic differential equation 

SH 

SH , , 

" = ^- 

where the stochastic variables are the so-called "white noise" terms. In eq.(I7) a fictitious 
fifth time coordinate t is introduced and H denotes a 4-dimensional Hamiltonian defined by 
the theory to be considered. The parameter 7 is a friction coefficient that can be tuned freely. 

For numerical simulations the continuous time derivatives are discretized and the time 
evolution is realized by discrete time integration schemes, using a finite time step (step size) 
Emd, like the standard leapfrog method [8]. The resulting discretized form of eq.(I7) finally 
leads to the Kramers equation algorithm which is an exact algorithm due to the introduction of 
a global accept/reject step. It was introduced for field theory simulations and tested in simpler 
models by Horowitz [I], who called it L2MC. In [2] the algorithm was tested the first time for 
Wilson QCD using SU{2) as the gauge group. It was found that it performs equally well as 
the Hybrid Monte Carlo algorithm. 

The improvements that have been used for this algorithm are even-odd preconditioning [10] 
and a better leapfrog integration scheme suggested by Sexton and Weingarten [II]. Recently 
we also started to investigate the biconjugate gradient stabilized algorithm for inverting the 
fermion matrix [12]. We found that the number of iterations to reach a given residue can 
decrease by as much as 40% of the number required in the conjugate gradient method. There 
are alternative possible improvements like the chronological extrapolation method to find better 
starting vectors for the conjugate gradient algorithm [13]. However, due to the use of large step 
sizes in the Sexton- Weingarten integration scheme, we do not expect further acceleration of 
the program in our case. The combination of the polynomial approximation of 1/s and the 
conjugate gradient inversion technique as proposed in [14] seems rather promising but we have 
not yet tested it. 

2.2 Boson algorithm 

The version of the bosonic algorithm that we use is described in detail in [5]. We took a com- 
bination of standard heatbath and over-relaxation techniques to update the gauge and scalar 



degrees of freedom. In this version several improvements have been implemented. The first is 
again even-odd preconditioning [10]. The lowest eigenvalue of the preconditioned matrix is 
larger than the corresponding eigenvalue of . As a result, in a simulation using precondition- 
ing, a smaller number of scalar field copies can be taken. This is very important not only from 
the point of view of memory requirements. The main improvement comes from the observation 
[5] that the autocorrelation time depends linearly on the number of scalar fields. Different kinds 
of mixing of heatbath and over-relaxation updates also lead to substantial improvements on 
the autocorrelation time. Finally, optimal choices of the parameter cm, see eq.(4), as proposed 
in [4] and [15] lead to further improvements. Choosing cm to be less than one raises the lowest 
eigenvalue of and therefore leads to a smaller number of scalar field copies. On the other 
hand, one has to make sure that the largest eigenvalue remains below and sufficiently far from 
one to avoid accidental slow bosonic modes in the simulation. 

In practice, the boson algorithm is run with a non- vanishing accuracy parameter (5, eq.(16). 
However, there have been several proposals to make the algorithm exact [4, 15]. 

2.3 Parameters for the test runs 

In this subsection we give explicitly the values of the algorithm parameters that have been 
used. The performance of the algorithms can react sensitively to changes of these parameters 
[2, 5]. The tunable parameters in the Kramers equation algorithm are the discrete finite step 
size Emd, the friction coefficient 7 and a repetition parameter k which determines how often the 
momenta are refreshed by generating them newly from a Gaussian distribution. We list the 
actually used parameter values in Table 1 for the three lattice sizes taken, namely &'12, 9>^12 
and IG"*. The parameter is chosen such that the acceptance rate is about 80%. 



Table 1: Technical parameters for both algorithms 





Kramers 


Boson 


Lattice 




k 


7 


e 


-^hoson ^ 


Cm 


6^12 


0.205 


3 


0.5 


0.01454 


18 2% 


0.6 


8^12 


0.185 


4 


0.5 


0.0061 


24 4% 


0.745 


16^ 


0.125 


5 


0.5 


0.0048 


44 0.38% 


0.7 



As mentioned in the introduction, the Chebyshev polynomial with roots given in eq.(13) 
approximates the function 1/s in the interval e < s < 1 with an exponential fitting error. For 
the QCD application, this interval is determined by the lowest Amin and largest A^ax eigenvalues 
of the preconditioned matrix [5] that is used in the simulations. It is therefore necessary 
that before running the boson algorithm an estimate of these two quantities is made. 

In Table 2 we list the expectation values of Amin and A^ax obtained by the conjugate gradient 
method [16]. For the boson algorithm, the parameter e should be chosen to be at the upper 
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edge of the distribution of Xmin (see fig. 3 in [5]). As mentioned in section 2.2, the parameter cm 
was chosen to be less than one. Lowering cm in Table 1 further for the 8"^12 and 16'^ lattices, 
while keeping € / {X^i^iQ'^)) and 6 fixed, does not lead to an improvement in the computational 
effort. This unexpected phenomenon was first observed in [5] and needs still to be understood. 
In order to have an agreement with results from HMC runs [4], the accuracy parameter 6 should 
be at the order of a few percent and the number of scalar fields Nj^oson as given in Table 1 has 
to be chosen accordingly. 





Table 2: Eigenvalue expectation 


values 


Lattice 


(A„,in(g')) 


(A„,ax(g')) 


6^12 


0.0115(4) 


0.9386(4) 


8^12 


0.00539(9) 


0.6100(2) 


16^ 


0.00478(3) 


0.6961(5) 



3 Results 

All results that will be presented below are obtained at /3 = 2.12 and n = 0.15. These 
parameters correspond to a pion to p mass ratio of about 0.95. Simulations at smaller mass 
ratios are very costly and a reliable determination of the autocorrelation times is difficult ^ . 
In Table 3 we give the results for several observables as obtained from both algorithms. We 
measure the plaquette expectation value (P), the pion mass and the /j-mass nip. The masses 
are determined through 

2G(i,/2) ' ' ' 

with appropriate correlation functions C{t). As was found earlier [4, 17], with the accuracy 
parameter 8 given in Table 1, the two algorithms give compatible results, although for the 
8"^ 12 lattice 8 seems to be too large to reach a real satisfactory agreement for the plaquette 
expectation value. 

We are mostly interested in the question of how much computer time is needed to reach 
a statistically independent configuration. To answer this question, one needs the speed of the 
program and most importantly the autocorrelation times. The latter quantity has been obtained 
by the "window" technique [6]. In this letter we will only compare the integrated autocorrelation 
time for the plaquette as an example. Other observables, like meson correlation functions or the 
lowest eigenvalues Amin, show similar behavior. The data sample taken was always more than 



^We actually performed runs at a pion to p-mass ratio of about 0.5 on a IG'* lattice. There the autocorrelation 
time increased substantially and could not be determined reliably. However, the rough estimates that we can 
obtain in this situation would not lead to a change of the conclusion as given below. 
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Table 3: Results for both algorithms 



Lattice Algorithm 



6^12 
8^12 
16^ 



Kramers 

Boson 

Kramers 

Boson 

Kramers 

Boson 



0.5803(2) 
0.5804(4) 
0.5777(3) 
0.5768(2) 
0.5778(1) 
0.5779(1) 



1.191(8) 

1.170(12) 

1.052(8) 

1.044(3) 

1.003(2) 

1.004(4) 



1.275(9) 

1.254(14) 

1.123(11) 

1.112(4) 

1.060(2) 

1.063(5) 



10 times the measured autocorrelation time. For the Kramers algorithm on the 6"^12 lattice and 
the boson algorithm on the 6"^ 12 and 8"^ 12 lattices, we have run several replica systems which 
provided completely independent data samples allowing for a reliable error estimation of the 
autocorrelation times. For the other lattices the total run was blocked into several sub-blocks, 
each of which again was several times larger than the measured autocorrelation time. Then the 
results from these sub-measurements were taken for the error analysis. Clearly, in this case the 
error determination is not as reliable as in the previous case, but it should nevertheless give a 
reasonable estimate of the error. We performed cross-checks on the integrated autocorrelation 
times by analyzing the exponential autocorrelation time and inspecting the blocked errors of 
the observables. We found the results from this analysis to be in agreement with the ones 
obtained by the window technique. 

The smaller 6"^12 and 8"^12 lattices have been run on the Ql version of the APF with 8 nodes. 
The larger 16"* lattice has been run on the QH2 version with 256 nodes. In the last two columns 
of Table 4 we give the autocorrelation time, T[sec], in real CPU-seconds taking the speed of the 
program into account. The subscript k (b) stands for the Kramers equation (boson) algorithm. 
These numbers provide a direct measure of how long to run each algorithm in real time to 
obtain an independent configuration on which measurements can be performed. We see that 
for the larger lattices the Kramers equation algorithm appears to be about a factor of 2 better 
than the boson algorithm. We note that on the 6"^ 12 lattice the situation is reversed which is 
presumably due to a non optimal tuning of the Kramers equation algorithm parameters. 

One may ask the question, whether the results given in Table 4 are specific for the Alenia 
Quadrics machine. We therefore tried to find a more machine independent criterion. The com- 
putationally most expensive part in the Kramers equation algorithm is the conjugate gradient 
method for the matrix inversion. This inversion, on the other hand, is dominated by matrix 
Q times vector (f> operations, denoted by Q^. Also for the bosonic algorithm similar fioating 
point operations dominate the program [4, 5]. For this reason we give the autocorrelation time 
t[(5$] in units of in columns 3 and 4. We see the same behavior as for real time in terms 
of this unit, which should give a more machine independent comparison of the performance of 
the two algorithms. 
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Table 4: Performance comparison ol the Kramers equation and the boson algorithms. 



Lattice 


1 • 

macnme size 






r 1 
n[sec\ 


r 1 
Tk[sec\ 


6^12 


Ql [8 nodes] 


14000(800) 


21000(4000) 


298(20) 


480(100) 


8^12 


Ql [8 nodes] 


36000(2250) 


17000(5000) 


1781(112) 


979(300) 


16^ 


QH2 [256 nodes] 


56000(18600) 


26000(11000) 


990(330) 


540(230) 



4 Conclusions 



Our conclusions can be summarized by inspecting Tables 3 and 4. The first Table confirms 
again [4, 17] that we now have two exact methods for simulations of lattice fermions in QCD 
which give compatible results for important observables of the lattice theory. In particular, 
the new bosonic algorithm can now provide important cross-checks for results that have been 
obtained earlier with the Hybrid Monte Carlo method. The bosonic method appears to be 
practical also on lattices of size 16"* where it needs only a moderate number of scalar field 
copies. It is certainly important that there exist two conceptually very different algorithms 
leading to the same results in the difficult area of lattice QCD simulations. 

Table 4 shows that for the physical situation we have considered here, namely m^/mp 0.95, 
and for the two versions of the algorithms that we have been testing, the boson algorithm is 
more costly in CPU-time than the Kramers equation algorithm. Taking the errors of the auto- 
correlation time into account, however, it is also seen that both algorithms perform comparably 
and are not orders of magnitude different. Given the already long history of the molecular dy- 
namics algorithms, it is more likely that new ways will be found to improve and accelerate the 
boson algorithm in the future. One attractive possibility is the reject/accept step as proposed 
first in [15]. In [14] this procedure was already tested (although for a different theory) and en- 
couraging results were reported. It remains to be seen, whether the bosonic algorithm will be 
advantageous on large lattices due to its better theoretical scaling behavior [3, 7, 17]. Another 
open question is how severe the problem with lack of reversibility [2, 12] can become for the 
molecular dynamics algorithms on large lattices. 
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